MLB Hall of Fame Criteria - An Analytical Approach

Author: Ben Davidson

Data Curation, Parsing, and Management

In this section I will demonstrate how to retrieve the dataset off the internet, load the data into the jupyter environment, and wrangle with the data to tidy it in order to prepare for analysis.

In this section, I will download a zip file hosted on github that contains the dataset. I then will extract the files out of the zipfile into the current directory. Once extracted, I load the .csv files into the environment.

Now that the data has been downloaded and is housed in a pandas dataframe, it is time to wrangle the data to fit our needs. For the purpose of formalizing Hall of Fame criteria the goal is to compose a table that houses all relevant career stats for each player throughout history. The dataset structure separates various types of stats into different tables. For instance, batting, fielding, and pitching each have their own table. We will want to combine these to represent a players career in one row, all in one table.

I have chosen to separate players into two distinct groups: Pitchers and Position Players. Historically in baseball there are exceptionally few players to ever pitch and play a non-pitcher position. Even fewer players to ever do this with any sort of consistency. It makes most sense to separate these players and treat them differently. They are also treated differently by pundits and have entirely different criteria for evaluation.

We also will want to add the column "HoF" for each player which will be a boolean value "True" or "False" that will define whether the player is a Hall of Famer or not. This will be necessary for the machine learning section where a classifier is built on labeled training data.

This is the initial state of the dataframe as given by the .csv file.

In the initial state shown above, each row represents one player's batting stats for one year. I wish to condense this table into each row representing one player's career statistics. To do this I will need to take every row for each playerID and sum the stats into one cumulative row for their career. This is valid because all statistics are numerical values where summing is valid.

In this section I will perform the process shown above, but on fielding data. The structure of the data and method for wrangling is very similar.

One uniqueness to fielding data that poses a new challenge is that there is a new column "POS" that describes the fielders position. This is problematic because we cannot "sum" qualitative position values over a career in any meaningful way. I chose to handle this by assigning each player one career position, their most popular. This is likely how players are remembered and defined. I do this by checking the players most played position by Games Played.

The purpose of this following section is to select each player's most played defensive position. I do this by doing another groupby playerID but this time instead of summing all columns, I will sum all columns except for "POS". For position I define the function to aggregate with to be "first". This will select the position value first seen, which will be there most played position as setup above.

In this section we will merge the tables for career batting and fielding stats to create one all-encompassing table for a player's career. I choose to merge with an outer join in order to preserve any players that may have only batting or only fielding stats. We would not want these players to be removed from our table.

As the last step of the wrangling process we must define whether a player has been inducted into the MLB Hall of Fame for the sake of machine learning and classification. I do this by creating a dictionary that maps every MLB player in history to a boolean that will be "True" if they are a Hall of Famer and "False" if they are not.

We have now created the table for position players! Great, now it is time to create the table for pitchers. This will involve similar processes but will be notably easier because all stats are found in one table rather than being divided between two tables for batting and fielding as for position players.

We begin by viewing the initial state of the table.

I begin by removing unnecessary columns and summing career statistics for each player as we did for position players.

We now have career stats for all the pitchers! All that's left to do is to add a HoF column in the same way we did before.

Exploratory Data Analysis

In this section we will do some initial exploratory analysis of the data we've cleaned in order to investigate suspected relationships between the data. We will use some libraries to assist with regression as well as data visualization. Visualizing the data is a great way to help us with EDA as we can spot relationships in the data much easier in well composed graphs than in very large tables.

Let's begin our investigation into what we think may cause players to be thought of as Hall of Fame quality. Players who are consistently good are much more likely to be inducted than those who are exceptional for a short period of time. For this reason I believe time in the league is a very important quality. Let's begin by exploring how games played (position players) and innings pitched (pitchers) effect hall of fame induction.